Language sensitive text classification
نویسندگان
چکیده
It is a traditional belief that in order to scale-up to more effective retrieval and access methods modern Information Retrieval has to consider more the text content. The modalities and techniques to fit this objectives are still under discussion. More empirical evidence is required to determine the suitable linguistic levels for modeling each IR subtask (e.g. information zoning, parsing, feature selection for indexing,...) and the corresponding use of this information. In this paper an original classification model sensitive to document syntactic information and characterized by a novel inference method is described. Extensive experimental evidence has been derived on real test data and also from well-established academic test sets. The results show that a significant improvement can be derived using the proposed inference model. Also the role of linguistic preprocessing seems to provide positive effects on the performance. POS tagging and recognition of Proper Nouns received a specific experimental attention and provided significant effects on measured accuracy.
منابع مشابه
Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملText Classification with Compression Algorithms
This work concerns a comparison of SVM kernel methods in text categorization tasks. In particular I define a kernel function that estimates the similarity between two objects computing by their compressed lengths. In fact, compression algorithms can detect arbitrarily long dependencies within the text strings. Data text vectorization looses information in feature extractions and is highly sensi...
متن کاملText Screening (Censorship) in Iran: A Historical Perspective
Censorship has a long history in Iran that has interfered with text production, i.e., original writing as well as translation. This phenomenon seems to have marked the borderline between the government and the ‘enlightened’ intellectuals throughout history in Iran. Different governments have delineated ‘redlines’ for authors and translators and dealt with these constructors of culture based on ...
متن کاملSemantics-based Language Models for Information Retrieval and Text Mining
Semantics-based Language Models for Information Retrieval and Text Mining Xiaohua Zhou Xiaohua Hu The language modeling approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smoothing techniques. In the thesis, we propose a novel context-sensitive semantic smoothing method referred to as a topic signature language model. It extracts exp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000